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Abstract 

Deep structured output learning shows great promise in tasks like seman¬ 
tic image segmentation. We proffer a new, efficient deep structured model 
learning scheme, in which we show how deep Convolutional Neural Net¬ 
works (CNNs) can be used to directly estimate the messages in message 
passing inference for structured prediction with Conditional Random Fields 
(CRFs). With such CNN message estimators, we obviate the need to learn 
or evaluate potential functions for message calculation. This confers sig¬ 
nificant efficiency for learning, since otherwise when performing structured 
learning for a CRF with CNN potentials it is necessary to undertake expen¬ 
sive inference for every stochastic gradient iteration. The network output 
dimension of message estimators is the same as the number of classes, rather 
than exponentially growing in the order of the potentials. Hence it is more 
scalable for cases that a large number of classes are involved. We apply 
our method to semantic image segmentation and achieve impressive per¬ 
formance, which demonstrates the effectiveness and usefulness of our CNN 
message learning method. 


Appearing in Proc. The Twenty-ninth Annual Conference on Neural Information 
Processing Systems (NIPS), 2015, Montreal, Canada. 

This work was in part supported by Data to Decisions CRC Centre. The authors would like to 
thank NVIDIA for the donations of the K40 graphic cards. 

Correspondence should be addressed to C. Shen. 


1 




Contents 


1 Introduction 

1.1 Related work. d 

2 Learning CRF with CNN potentials 3 

3 Learning CNN message estimators 

3.1 CNN message estimators. d 

3.2 Details for message estimator networks. 0 

3.3 Training CNN message estimators . d 

3.4 Message learning with inference-time budgets. d 

4 Experiments d 

5 Conclusion M 


2 







1 Introduction 


Learning deep structured models has attracted considerable research attention recently. 
One popular approach to deep structured model is formulating conditional random fields 
(CRFs) using deep Convolutional Neural Networks (CNNs) for the potential functions. This 
combines the power of CNNs for feature representation learning and of the ability for CRFs 
to model complex relations. The typical approach for the joint learning of CRFs and CNNs 
[n 11II [5], is to learn the CNN potential functions by optimizing the CRF objective, 
e.g., maximizing the log-likelihood. The CNN and CRF joint learning has shown impressive 
performance for semantic image segmentation. 

For the joint learning of CNNs and CRFs, stochastic gradient descent (SGD) is typically 
applied for optimizing the conditional likelihood. This approach requires the marginal infer¬ 
ence for calculating the gradient. For loopy graphs, marginal inference is generally expensive 
even when using approximate solutions. Given that learning the CNN potential functions 
typically requires a large number of gradient iterations, repeated marginal inference would 
make the training intractably slow. Applying an approximate training objective is a solu¬ 
tion to avoid repeat inference; pseudo-likelihood learning [B] and piecewise learning dig are 
examples of this kind of approach. In this work, we advocate a new direction for efficient 
deep structured model learning. 

In conventional CRF approaches, the final prediction is the result of inference based on 
the learned potentials. However, our ultimate goal is the final prediction (not the poten¬ 
tials themselves), so we propose to directly optimize the inference procedure for the final 
prediction. Our focus here is on the extensively studied message passing based inference 
algorithms. As discussed in [5], we can directly learn message estimators to output the 
required messages in the inference procedure, rather than learning the potential functions 
as in conventional CRF learning approaches. With the learned message estimators, we then 
obtain the final prediction by performing message passing inference. 

Our main contributions are as follows. 

• We explore a new direction for efficient deep structured learning. We propose to 
directly learn the messages in message passing inference as training deep CNNs in 
an end-to-end learning fashion. Message learning does not require any inference 
step for the gradient calculation, which allows efficient training. It can be cast into 
traditional classification problems. 

The network output dimension for message estimation is the same as the number of 
classes (AT), while the network output for general CNN potential functions in CRFs 
is Ar“, which is exponential in the order (a) of the potentials (for example, a = 2 for 
pairwise potentials, a = 3 for triple-cliques, etc). Hence CNN based message learn¬ 
ing has significantly fewer network parameters and thus is more scalable, especially 
in the cases of a large number of classes involved. 

• The number of iterations in message passing inference can be explicitly taken into 
consideration in the message learning procedure. In this paper, we are particularly 
interested in learning messages that are able to offer high-quality CRF prediction re¬ 
sults with only one message passing iteration, making the message passing inference 
very fast. 

• We apply our method to semantic image segmentation on the PASCAL VOC 2012 
dataset and achieve impressive performance. 

1.1 Related work 

Combining the strengths of CNNs and CRFs for segmentation has been explored in several 
recent methods. Some methods resort to a simple combination of CNN classifiers and CRFs 
without joint learning. DeepLab-CRF in [g first train fully CNN for pixel classification and 
applies a dense CRF [TO] method as a post-processing step. Later the method in |g extends 
DeepLab by jointly learning the dense CRFs and CNNs. RNN-CRF in [1] also performs 
joint learning of CNNs and the dense CRFs. They implement the mean-field inference 
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as Recurrent Neural Networks which facilitates the end-to-end learning. These methods 
usually use CNNs for modelling the unary potentials only. The work in [3] trains CNNs to 
model both the unary and pairwise potentials in order to capture contextual information. 
Jointly learning CNNs and CRTs has also been explored for other applications like depth 
estimation mil]. The work in explores joint training of Markov random fields and deep 
networks for predicting words from noisy images and image classification. 

All these above-mentioned methods that combine CNNs and CRTs are based upon conven¬ 
tional CRF approaches. They aim to jointly learn or incorporate pre-trained CNN potential 
functions, and then perform inference/prediction using the potentials. In contrast, our 
method here directly learns CNN message estimators for the message passing inference, 
rather than learning the potentials. 

The inference machine proposed in [S] is relevant to our work in that it has discussed 
the idea of directly learning message estimators instead of learning potential functions for 
structured prediction. They train traditional logistic regressors with hand-crafted features 
as message estimators. Motivated by the tremendous success of CNNs, we propose to train 
deep CNNs based message estimators in an end-to-end learning style without using hand¬ 
crafted features. Unlike the approach in [5] which aims to learn variable-to-factor message 
estimators, our proposed method aims to learn the factor-to-variable message estimators. 
Thus we are able to naturally formulate the variable marginals, which is the ultimate goal 
for CRF inference, as the training objective (see Sec. 13.31) . The approach in [T^] jointly 
learns CNNs and CRFs for pose estimation, in which they learn the marginal likelihood 
of body parts but ignore the partition function in the likelihood. Message learning is not 
discussed in this work, and the exact relation between this pose estimation approach and 
message learning remains unclear. 


2 Learning CRF with CNN potentials 


Before describing our message learning method, we review the CRF-CNN joint learning 
approach and discuss limitations. An input image is denoted hy x G X and the corresponding 
labeling mask is denoted by y € y. The energy function is denoted by E{y,x), which 
measures the score of the prediction y given the input image x. We consider the following 
form of conditional likelihood: 


P{y\x) 


1 

W) 


exp [-E{y,x)\ 


exp [-E{y,x)\ 
'Ey' exp[-E{y',x)]' 


( 1 ) 


Here Z is the partition function. The CRF model is decomposed by a factor graph over a 
set of factors T. Generally, the energy function is written as a sum of potential functions 
(factor functions): 


E{y,x) = Y,p^^EF{yF,xp). (2) 

Here E indexes one factor in the factor graph; yp denotes the variable nodes which are 
connected to the factor E; Ep is the (log-) potential function (factor function). The potential 
function can be a unary, pairwise, or high-order potential function. The recent method in 
[3] describes examples of constructing general CNN based unary and pairwise potentials. 


Take semantic image segmentation as an example. To predict the pixel labels of a test image, 
we can find the mode of the joint label distribution by solving the maximum a posteriori 
(MAP) inference problem: y* = argmaxP(y|a;). We can also obtain the final prediction 
by calculating the label marginal distribution of each variable, which requires to solve a 
marginal inference problem: 


Vp e N : P{yp\x) = Ey\y^ Pivl^)- 


(3) 


Here y\yp indicates the output variables y excluding j/p. For a general CRF graph with 
cycles, the above inference problems is known to be NP-hard, thus approximate inference 
algorithms are applied. Message passing is a type of widely applied algorithms for approx¬ 
imate inference: loopy belief propagation (BP) [13], tree-reweighted message passing [14] 
and mean-held approximation m are examples of the message passing methods. 
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CRF-CNN joint learning is to learn CNN potential functions by optimizing the CRF objec¬ 
tive, typically, the negative conditional log-likelihood, which is: 

- log P{y\x\ 9) = E{y, x; 9) + log Z[x\9). (4) 

The energy function E(y,x) is constructed by CNNs, for which all the network parame¬ 
ters are denoted by 9. Adding regularization, minimizing negative log-likelihood for CRF 
learning is: 

mine | \\9\\l + + log Z{xP^-9)]. (5) 


Here yl'l denote the i-th training image and its segmentation mask; N is the number of 
training images; A is the weight decay parameter. We can apply stochastic gradient descent 
(SGD) to optimize the above problem for learning 9. The energy function E{y, x; 9) is 
constructed from CNNs, and its gradient ygE{y, x; 9) can be easily computed by applying 
the chain rule as in conventional CNNs. However, the partition function Z brings difficulties 
for optimization. Its gradient is: 


Velog Z{x-,9)^Y. 

y 


exp [-E{y,x-,9)\ 
J2y' exp[-A(y',a;;e)] 


S/g[-E{y,x-,9)] 


= -'^y^Piy\a.-e)^eE{y,x-,9). 


( 6 ) 


Direct calculation of the above gradients is computationally infeasible for general CRF 
graphs. Usually it is necessary to perform approximate marginal inference to calculate 
the gradients at each SGD iteration [T3]. However, repeated marginal inference can be 
extremely expensive, as discussed in [3]. CNN training usually requires a huge number of 
SGD iterations (hundreds of thousands, or even millions), hence this inference based learning 
approach is in general not scalable or even infeasible. 


3 Learning CNN message estimators 


In conventional CRF approaches, the potential functions are first learned, and then infer¬ 
ence is performed based on the learned potential functions in order to generate the final 
prediction. In contrast, our approach directly optimizes the inference procedure for final 
prediction. We propose to learn CNN estimators to directly output the required intermedi¬ 
ate values in an inference algorithm. 


Here we focus on the message passing based inference algorithm which has been extensively 
studied and widely applied. In the CRF prediction procedure, the “message” vectors are 
recursively calculated based on the learned potentials. We propose to construct and learn 
CNNs to directly estimate these messages in the message passing procedure, rather than 
learning the potential functions. In particular, we directly learn factor-to-variable message 
estimators. Our message learning framework is general and can accommodate all message 
passing based algorithms such as loopy belief propagation (BP) [T^, mean-field approxi¬ 
mation |13j and their variants. Here we discuss using loopy BP for calculating variable 
marginals. As shown by Yedidia et al. loopy BP has a close relation with Bethe free 
energy approximation. 


Typically, the message is a AT-dimensional vector {K is the number of classes) which encodes 
the information of the label distribution. For each variable-factor connection, we need to 
recursively compute the variable-to-factor message: and the factor-to-variable 

message: € K^. For numerical reasons, the log operation is applied to the marginals 

before deriving message passing algorithms. The unnormalized variable-to-factor message 
is computed as: 

^p^Fhjp) f^F'^phjp)- C^) 


Here Tp is a set of factors connected to the variable p; Tp\F is the set of factors Tp excluding 
the factor F. For loopy graph, the variable-to-factor message is normalized in each iteration: 


f3p^F{yp) = log 


exp (5p^p{yp) 
Sp; exp ^p^piy'p) 


( 8 ) 
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The factor-to-variable message is computed as: 


f^F^piVp) = log exp 

y'F\y'p’V'p=vp 


EFiy'p) + f^q^piy'q) 

(je]NfF\p 


(9) 


Here is a set of variables connected to the factor F; J^f\p is the set of variables Tip 
excluding the variable p. Once we get all the factor-to-variable messages of one variable 
node, we are able to calculate the marginal distribution (beliefs) of that variable: 


P{yp\x) = Y = :^exp 

y\yp 


E 


F^piUp) 


( 10 ) 


in which Zp is a normalizer: Zp = [SFeff'p I^F^piVp)]- 


3.1 CNN message estimators 


The calculation of factor-to-variable message I3p^p depends on the variable-to-factor mes¬ 
sages f3p^F- Substituting the definition of fip^p i^^ f^F^p can be re-written as: 


f^F^piyp) = log Y 

v'f\Vp>V'p=Vp 
— log 'Y^ exp 

V'F\v'q^yp = VP 


EFiv'p) + E 

ijGWf\p - 

EFiy'p) + Y 

ggWpXp ■ 


exp I3g^p{y'j 


Sy''exp/3,^p.(2/"). 

exp J2F'G3^q\F f^F'^qiy'q) 




Ly" exp IOf'gJpXf f^F'^qyyq / J 

( 11 ) 


Here q denotes the variable node which is connected to the node p by the factor F in the 
factor graph. We refer to the variable node g as a neighboring node of q. 1 ^f\p is a set 
of variables connected to the factor F excluding the node p. Clearly, for a pairwise factor 
which only connects to two variables, the set Nf\p only contains one variable node. The 
above equations show that the factor-to-variable message fip^p depends on the potential 
Ep and (3p,^g. Here (3p,^g is the factor-to-variable message which is calculated from a 
neighboring node q and a factor F' ^ F. 


Conventional CRT learning approaches learn the potential function then follow the above 
equations to compute the messages for calculating marginals. As discussed in [8], given that 
the goal is to estimate the marginals, it is not necessary to exactly follow the above equations, 
which involve learning potential functions, to calculate messages. We can directly learn 
message estimators, rather than indirectly learning the potential functions as in conventional 
methods. 


Consider the calculation in (HU. The message l3p^p depends on the observation Xpp and 
the messages l3p,^g. Here Xpp denotes the observations that correspond to the node p 
and the factor F. We are able to formulate a factor-to-variable message estimator which 
takes Xpp and (3pi^g as inputs and outputs the message vector, and we directly learn such 
estimators. Since one message l3p-^p depends on a number of previous messages I3p,^g, we 
can formulate a sequence of message estimators to model the dependence. Thus the output 
from a previous message estimator will be the input of the following message estimator. 

There are two message passing strategies for loopy BP: synchronous and asynchronous 
passing. We here focus on the synchronous message passing, for which all messages are 
computed before passing them to the neighbors. The synchronous passing strategy results 
in much simpler message dependences than the asynchronous strategy, which simplifies the 
training procedure. We define one inference iteration as one pass of the graph with the 
synchronous passing strategy. 

We propose to learn CNN based factor-to-variable message estimator. The message estima¬ 
tor models the interaction between neighboring variable nodes. We denote by M a message 
estimator. The factor-to-variable message is calculated as: 

f^F—^piVp) — ^piXpF i^pFjyp)- ( 12 ) 
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We refer to dp^ as the dependent message feature vector which encodes all dependent 
messages from the neighboring nodes that are connected to the node p by F. Note that the 
dependent messages are the output of message estimators at the previous inference iteration. 
In the case of running only one message passing iteration, there is no dependent messages 
for Mp, and thus we do not need to incorporate dpp. To have a general exposition, we here 
describe the case of running arbitrarily many inference iterations. 


We can choose any effective strategy to generate the feature vector dpp from the dependent 
messages. Here we discuss a simple example. According to (fTTl) . we define the feature vector 
dpp as a AT-dimensional vector which aggregates all dependent messages. In this case, dpp 
is computed as: 


dpF{y) 


E 


log 




exp y~]F/cqr^\^p Mpi(xgpi , dgpi , y) 

Thy’ Tp'eS’gXF Mpi{Xqp/, dqpi , y') _ 


(13) 


With the definition of dp^ in (fT^ and jdp^p in (fT^ . it clearly shows that the message 
estimation requires evaluating a sequence of message estimators. Another example is to 
concatenate all dependent messages to construct the feature vector dp^. 

There are different strategies to formulate the message estimators in different iterations. 
The simple strategy is using the same message estimator across all inference iteration. In 
this case the message estimator becomes a recursive function, and thus the CNN based 
estimator becomes a recurrent neural network (RNN). Another strategy is to formulate 
different estimator for each inference iteration. 


3.2 Details for message estimator networks 

We formulate the estimator Mp as a CNN, thus the estimation is the network outputs: 

f^p^piVp) = Mp{xpp, dpp,yp\Op) = = yp)ZpF,kix, dpp; Op). (14) 

Here Op denotes the network parameter which we need to learn, d(-) is the indicator 
function, which equals 1 if the input is true and 0 otherwise. We denote by Zpp G 
as the AT-dimensional output vector {K is the number of classes) of the message estimator 
network for the node p and the factor A; Zpp^k is the /c-th value in the network output Zpp 
corresponding to the fc-th class. 

We can consider any possible strategies for implementing Zpp with CNNs. For example, we 
here describe a strategy which is analogous to the network design in [3]. We denote by 
as a fully convolutional network (FCNN) [TB] for convolutional feature generation, and 
as a traditional fully connected network for message estimation. 

Given an input image x, the network output C^^\x) G ]R^ix^ 2 xr jg convolutional feature 
map, in which Ni x N 2 = N is the feature map size and r is the dimension of one feature 
vector. Each spatial position (each feature vector) in the feature map (x) corresponds 
to one variable node in the CRF graph. We denote by C^^^x,p) G R’" as the feature vector 
corresponding to the variable node p. Likewise, C^^'^{x,'iNp\p) G R.’’ is the averaged vector 
of the feature vectors that correspond to the set of nodes 'Np\p. Recall that l^p\p is a set of 
nodes connected by the factor F excluding the node p. For pairwise factors, 'Np\p contains 
only one node. 

We construct the feature vector z^p'’ G R^'^ for the node-factor pair (p, F) by concatenating 
C^^^x,p) and C^^^x,'Np\p). Finally, we concatenate the node-factor feature vector z^p'' 
and the dependent message feature vector dp^ as the input for the second network 
Thus the input dimension for is (2r + K). For running only one inference iteration, 
the input for is z^p^ alone. The final output from the second network is the 
AT-dimensional message vector Zpp. To sum up, we generate the final message vector Zpp 
as: 

Zpp = [C^^'>{x,p)^; C^'^\x,'Np\p)^-, d^p f }. (15) 


7 





For a general CNN based potential function in conventional CRFs, the potential network 
is usually required to have a large number of output units (exponential in the order of the 
potentials). For example, it requires {K is the number of classes) outputs for the pairwise 
potentials [3]. A large number of output units would significantly increase the number of 
network parameters. It leads to expensive computations and tend to over-fit the training 
data. In contrast, for learning our CNN message estimator, we only need to formulate K 
output units for the network. Clearly it is more scalable in the cases of a large number of 
classes. 


3.3 Training CNN message estimators 


Our goal is to estimate the variable marginals in ([3]), which can be re-written with the 
estimators: 


Piyp\x) = P{y\x) 
v\Vp 


1 

= ^exp 


Y f^F^piVp) 


— exp Y MF{xpF,dpF,yp;6F)- 


Here Zp is the normalizer. The ideal variable marginal has the probability of 1 for the 
ground truth class and 0 for the remaining classes. Here we consider the cross entropy loss 
between the ideal marginal and the estimated marginal. 


J{x,y-e) 


K 

-YYl = yp') Pivpix; o) 


K 

-YJ 2 ^^yp = 

vp —1 


exp Y.Fe3^^^F{xpF,dpF,yp;9F) 
Ey^exp J2F&jp^'^Fi^pF^^pF^yp^^F)' 


(16) 


in which ijp is the ground truth label for the variable node p. Given a set of N training images 
and label masks, the optimization problem for learning the message estimator network is: 


mine 




(17) 


The work in [5] propose to learn the variable-to-factor message {f3p^p). Unlike their ap¬ 
proach, we aim to learn the factor-to-variable message {(3p^p), for which we are able to 
naturally formulate the variable marginals, which is the ultimate goal for prediction, as the 
training objective. Moreover, for learning f3p^p in their approach, the message estimator 
will depend on all neighboring nodes (connected by any factors). Given that variable nodes 
will have different number of neighboring nodes, they only consider a fixed number of neigh¬ 
boring nodes (e.g., 20) and concatenate their features to generate a fixed-length feature 
vector for classification. In our case for learning Pp^p, the message estimator only depends 
a fixed number of neighboring nodes (connected by one factor), thus we do not have this 
problem. Most importantly, they learn message estimators by training traditional proba¬ 
bilistic classifiers (e.g., simple logistic regressors) with hand-craft features, and in contrast, 
we train deep CNNs in an end-to-end learning style without using hand-craft features. 


3.4 Message learning with inference-time budgets 

One advantage of message learning is that we are able to explicitly incorporate the expected 
number of inference iteration into the learning procedure. The number of inference iteration 
defines the learning sequence of message estimators. This is particular useful if we aim to 
learn the estimators which are able to have high-quality prediction for only running a few 
number of inference iterations. In contrast, the conventional potential function learning in 
CRFs are not able to directly incorporate the expected number of inference iterations. 

We are particularly interested in learning message estimators for using only one message 
passing iteration, for which the inference can be very fast. In this case it might be preferable 
to have large-range neighborhood connections, for which the large range interaction can be 
captured by running one inference pass. 





Table 1: Segmentation results on the PASCAL VOC 2012 “val” set. We compare with 
several recent CNN based methods with available results on the “val” set. Our method 
performs the best. 


method 

training set 

train (approx.) 

loU val set 

ContextDCRF 

VOC extra 

10k 

70.3 

Zoom-out jTJ] 

VOC extra 

10k 

63.5 

Deep-struct |2] 

VOC extra 

10k 

64.1 

DeepLab-CRF [2] 

VOC extra 

10k 

63.7 

DeepLap-MCL El 

VOC extra 

10k 

68.7 

BoxSup 

VOC extra 

10k 

63.8 

iiox^up 1181 

VOC extra + COCO 

I33E 

68.1 

ours 

VOC extra 

loi;^ 

71.1 

ours-|- 

VOC extra 

10k 

73.3 


4 Experiments 

We evaluate the proposed CNN message learning method for semantic image segmentation. 
We use the publicly available PASCAL VOC 2012 dataset [19]. There are 20 object cate¬ 
gories and one background category in the dataset. Its contains 1464 images in the training 
set, 1449 images in the “val” set and 1456 images in the test set. Following the common 
setting in [ZQlE], the training set is augmented to 10582 images by including the extra anno¬ 
tations provided in [U] for the VOC images. We use intersection-over-union (loU) score [H] 
to evaluate the segmentation performance. For the learning and prediction of our method, 
we only use one message passing iteration. 

The recent work in |3] (referred to as ContextDCRF) learns multi-scale fully convolutional 
CNNs (FCNNs) for unary and pairwise potential functions to capture contextual infor¬ 
mation. We follow this CRF learning method and replace the potential functions by the 
proposed message estimators. We consider 2 types of spatial relations for constructing the 
pairwise connections of variable nodes. One is the “surrounding” spatial relation, for which 
one node is connected to its surround nodes. The other one is the “above/below” spatial 
relation, for which one node is connected to the nodes that lie above. For the pairwise 
connections, the neighborhood size is defined by a range box. We learn one type of unary 
message estimator and 3 types of pairwise message estimators in total. One type of pair¬ 
wise message estimator is for the “surrounding” spatial relations, and the other two are for 
the “above/below” spatial relations. We formulate one network for one type of message 
estimator. 

We formulate our message estimators as multi-scale FCNNs, for which we apply a similar 
network configuration as in |3]. The network (see Sec l3.2l for details) has 6 convolution 
blocks and has 2 fully connected layers (with K output units). Our networks are 
initialized using the VGG-16 model [22]. We train all layers using back-propagation. 

We first evaluate our method on the VOC 2012 “val” set. We compare with several recent 
CNN based methods with available results on the “val” set. Results are shown in Table 
llJ Our method achieves the best performance. As mentioned, ContextDCRF learns CNN 
based potential functions in CRFs to capture contextual information. ContextDCRF follows 
a conventional CRF learning and prediction scheme: They first learn potentials and then 
perform inference based on the learned potentials to output final predictions. The result 
shows that learning the CNN message estimators is able to achieve similar performance 
compared to learning CNN potential functions in CRFs. Note that here we only use one 
message passing iteration for the training and prediction, the inference time cost is almost 
negligible. Hence our method enables much more efficient inference. 

To further improve the performance, we perform simple data augmentation in training. We 
generate extra 4 scales ([0.8, 0.9,1.1,1.2]) of the training images and their flipped images for 
training. This result is denoted by “ours-|-” in the result table. 

We further evaluate our method on the VOC 2012 test set. We compare with recent state- 
of-the-art CNN methods with competitive performance. The results are described in Table 
|3j Since the ground truth labels are not available for the test set, we evaluate our method 
through the VOC evaluation server. We achieve impressive performance on the test set: 
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Table 2: Category results on the PASCAL VOC 2012 test set. Our method performs the 
best. 


method 

mean 

o 

1 

bike 

bird 

boat 

bottle 

bus 


1e 

chair 

cow 

table 

dog 

horse 

1 

person 

potted 

sheep 

sofa 

■§ 

> 

DeepLab-CRf' |5| 

66.4 

78,4 

33.1 

78.2 

55.6 

65.3 

81.3 

75.5 

78.6 

25.3 

69.2 

52.7 

75.2 

69.0 

79.1 

77,6 

54.7 

78.3 

45.1 

73.3 

56,2 

DeepLab-MCL IS) 

71,6 

84.4 

54.5 

81.5 

63.6 

65.9 

85.1 

79.1 

83.4 

30.7 

74,1 

59.8 

79.0 

76.1 

83.2 

80.8 

59.7 

82.2 

50.4 

73.1 

63.7 

FCN-8S m 

62.2 

76.8 

34.2 

68,9 

49.4 

60.3 

75.3 

74.7 

77.6 

21.4 

62,5 

46.8 

71,8 

63,9 

76.5 

73.9 

45.2 

72.4 

37.4 

70.9 

55.1 

CRF-RNN n 

72.0 

87.5 

39.0 

79.7 

64.2 

68.3 

87.6 

80.8 

84.4 

30.4 

78.2 

60.4 

80,5 

77.8 

83.1 

80.6 

59.5 

82.8 

47.8 

78.3 

67,1 

ours 

73.4 

90.1 

38.6 

77,8 

61.3 

74.3 

89.0 

83.4 

83,3 

36.2 

80.2 

56.4 

81.2 

81.4 

83.1 

82.9 

59.2 

83.4 

54.3 

80.6 

70.8 


Table 3: Segmentation results on the PASCAL VOC 2012 test set. Compared to methods 
that use the same augmented VOC dataset, our method has the best performance. 


method 

training set 

train (approx.) 

loU test set 

ContextDCRF El 

VOC extra 

10k 

70.7 

Znom-niit |1 J\ 

VOC extra 

10k 

64.4 

FCN-Ss HSI 

VOC extra 

10k 

62.2 

sDs mi 

VOC extra 

10k 

51.6 

DeconvNet-CHb' [2]^ 

VOC extra 

10k 

72.5 

DeepLab-CRF El 

VOC extra 

10k 

66.4 

DeepLab-MCL El 

VOC extra 

10k 

71.6 

CRF-RNN IB 

VOC extra 

10k 

72.0 

DeepLab-CRF |24J 

VOC extra -h COCO 

133k 

70.4 

DeeoLab-MCL 1241 

VOC extra + COCO 

133k 

72.7 

BoxSup (semi) [1^ 

VOC extra + COCO 

133k 

71.0 

CRF-RNN IB 

VOC extra + COCO 

133k 

74.7 

ours 

VOC extra 

10k 

73.4 


73.4 loU scor43, which is the so far best performance compared to the methods that use the 
same augmented VOC dataset [21] (marked as “VOC extra” in the table). These results 
validate the effectiveness of direct message learning with CNNs. 

We also include the comparison with the methods which are trained on the much larger 
COCO dataset (around 133K training images). Our performance is comparable with these 
methods, while our method uses much less number of training images. 

The results for each category is shown in Table |2| We compare with several recent methods 
which transfer layers from the same VGG-16 model and use the same training data. Our 
method performs the best for most categories. 


5 Conclusion 

We have proposed a new deep message learning framework for structured GRF prediction. 
Learning deep message estimators for the message passing inference reveals a new direction 
for learning deep structured model. Learning CNN message estimators is efficient, which 
does not involve expensive inference steps for gradient calculation. The network output 
dimension for message estimation is the same as the number of classes, which does not 
increase with the order of the potentials, and thus CNN message learning has less network 
parameters and is more scalable in the number of classes compared to conventional potential 
function learning. Our impressive performance for semantic segmentation demonstrates the 
effectiveness and usefulness of the proposed deep message learning. Our framework is general 
and can be readily applied to other structured prediction applications. 
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